Ashley Decker

Sport and Big Data

Country Success and Dominance in the Olympics

Introduction:

The Olympics are an international sporting extravaganza in which countries can boast their athletic skill and stimulate national pride. The modern Olympic Games have been running in some form for the past 120 years with few cancellations and exceptions (for more historical background on the Olympics see my previous work in a timeline format here). The ceremony and athletic showmanship of the Games is entertaining and over the 120 years of Olympics there has been a lot of interesting data collected about the athletes, sports, and countries involved. Have you ever wondered when watching the Olympic Games why certain countries always seem to dominate certain sports? For instance, Norway and cross-country skiing, the Netherlands and speed skating, the United States and swimming (thank you Michael Phelps). This study analyzes 120 years of Olympic Games data in conjunction with GDP data to research the reasons for national dominance in Olympic sport.

Research Questions:

Loading the data:

The five datasets from Kaggle are athletes_events.csv and noc_regions.csv, olym.csv, gdp_data.csv, and UN_Gdp_data.

*note: to run this notebook locally- download this notebook and csv files at the above links into one folder and name dataframes accordingly.

- The athletes dataset contains the following columns:

- The NOC dataset contains the following columns:

- The Hosts dataset contains the following columns:

- The UN GDP dataset contains the following columns:

- The WorldBank GDP dataset contains the following columns:

Merging the Olympics data:

I merged the athlets and noc datasets from Kaggle using a left join so that the NOCs match the teams in the athletes dataset.

Dealing with missing and inconsistent data:

There are missing values in the Age, Height, Weight, Medal, region, and notes columns. To make analysis easier, I will be dropping duplicate values and imputing missing values.

# remove duplicate entries for year and city # where there are two different cities listed for a year, # that indicates there was a summer and winter olympics held at the same time. # the summer and winter olympics didn't begin to stagger until the 1990s. data[['Year','City']].drop_duplicates().sort_values('Year')s_names = data.loc[data['Season']=='Summer'] w_names = data.loc[data['Season']=='Winter'] #print(sorted(list(w_names.Country.unique()))) #print(sorted(list(w_names.City.unique()))) #print(sorted(list(s_names.Country.unique()))) #print(sorted(list(s_names.City.unique()))) # create a dictionary for all the summer olympic countries # and corresponding cities in the dataset summer_country_dict = {'Athina':'Greece', 'Paris':'France', 'St. Louis':'USA', 'London':'UK', 'Stockholm':"Sweden", 'Antwerpen':'Belgium', 'Amsterdam':'Netherlands', 'Los Angeles':'USA', 'Berlin':'Germany', 'Helsinki':'Finland', 'Melbourne':'Australia', 'Roma':'Italy', 'Tokyo':'Japan', 'Mexico City':'Mexico', 'Munich':'Germany', 'Montreal':'Canada', 'Moskva':'Russia', 'Seoul':'South Korea', 'Barcelona':'Spain', 'Atlanta':'USA', 'Sydney':'Australia', 'Beijing':'China', 'Rio de Janeiro':'Brazil'} # create a dictionary for all the winter olympic countries # and corresponding cities in the dataset winter_country_dict = {'Chamonix':'France', 'Lake Placid':'USA', 'Sankt Moritz':"Switzerland", 'Garmisch-Partenkirchen':'Germany', 'Oslo':'Norway', 'Squaw Valley':'USA', 'Sarajevo':'Yugoslavia', 'Grenoble':'France', 'Torino':'Italy', 'Nagano':'Japan', 'Sapporo':'Japan', "Cortina d'Ampezzo":'Italy', 'Albertville':'France', 'Calgary':'Canada', 'Sochi':'Russia', 'Vancouver':'Canada', 'Lillehammer':'Norway', 'Salt Lake City':'USA', 'Innsbruck':'Austria'} # subset only the summer olympic data summer = data.loc[data['Season']=='Summer'] # create a Host_Country column that maps the # country to the city using the summer country dictionary summer['Host_Country']=summer['City'].map(summer_country_dict) # subset only the winter olympic data winter = data.loc[data['Season']=='Winter'] # create a Host_Country column that maps the # country to the city using the winter country dictionary winter['Host_Country']=winter['City'].map(winter_country_dict)

Visualization 1a:

Countries with the most medals for summer olympics

To count medals for countries, I want to count a team win as 1 medal (rather than the sum of all medals won by each individual athlete in a team event) to account for variation in size of teams for different sports.

I first explored which countries were overall most dominant in the Olympics according to their all time medal counts. This figure displays the top ten medal winning countries of all time in the Summer Olympics.

# exclude rows where athlete did not win a medal medals = summer.loc[summer['Medal']!=0] medals['Medal_Won'] = 1 # create a pivot table summing the medals won team = pd.pivot_table(medals, index = ['Country', 'Year', 'Event'], columns = 'Medal', values = 'Medal_Won', aggfunc = 'sum', fill_value = 0).reset_index() team = team.loc[team['Gold'] > 1, :] team_sports = team['Event'].unique() team_sports = list(set(team_sports) - set(["Swimming Women's 100 metres Freestyle"," Swimming Men's 50 metres Freestyle", "Gymnastics Women's Balance Beam","Gymnastics Men's Horizontal Bar"])) medals['Team_Event'] = np.where(medals.Event.map(lambda x: x in team_sports),1,0) medals['Individual_Event'] = np.where(medals.Team_Event,0,1) # tally medals for individual and team events medals_tally = medals.groupby(['Year', 'NOC', 'Country','Sport','Sex','Event', 'Medal'])[['Medal_Won', 'Team_Event','Individual_Event']].agg('sum').reset_index() medals_tally['Medal_Count'] = medals_tally['Medal_Won']/(medals_tally['Team_Event']+medals_tally['Individual_Event'])

Visualization 1b:

Countries' medal counts by medal type

From the previous graph, we can see that the top 10 countries by medals won over the 120 years of Summer Olympics are as follows:

  1. USA
  2. Russia
  3. Germany
  4. UK
  5. France
  6. Italy
  7. China
  8. Australia
  9. Sweden
  10. Hungary

Next, I look at these same countries' medal counts broken down into gold, silver, and bronze. The countries are sorted by most to least gold medals. Again, the U.S. and Russia have the most gold medals. What's more, these countries both have more golds than any other medal type. A country like Italy has more total medals than China, however, China has more gold medals than Italy.

Visualization 2a:

Dominant Countries in Summer Sports

Countries that appear to have “dynasties” in certain olympic sports include Russia in weightlifting, the USA in basketball, and China in table tennis (Gehrz). I explored these three sports and countries further in the following visual analysis.

Russia and Weightlifting

The pie chart below shows that Russia lays claim to over 35% of all weightlifting Olympic medals.

Visualization 2b:

USA and Basketball

This next pie chart shows that the USA has won nearly 44% of all basketball Olympic medals.

Visualization 2c:

China and Table Tennis

The last pie chart demonstrates that China has won a whopping 62.8% of all Table Tennis medals in the Olympics.

How Does GDP Affect Olympic Success?

Surely, dominating the medal counts can't all be attributed to supremacy of the athletes or cultural affinity for the sport. I wonder, does economics have anything to do with olympic success? Next, I will explore if GDP is correlated to higher medal counts.

# Dummy code medals data_with_gdp['Medal']=data_with_gdp['Medal'].fillna(0) data_with_gdp['Medal_win']=data_with_gdp['Medal'].replace({'Gold':int(1), 'Silver':int(1), 'Bronze':int(1)}) data_with_gdp.head()# normalize GDP value data_with_gdp['Value']=data_with_gdp['Value']/data_with_gdp.\ groupby('Year')['Value'].transform('max')

Visualization 3a:

Summer Olympic Medal Count vs GDP

There does seem to be a slight positive correlation here with GDP and winning more medals. There are a few outliers, however. Countries like Germany, USA, Russia, and China would seem to inflate the slope of Medals won given GDP. Conversely, countries like Switzerland, Norway, and Sweden appear to bring that slope down as they win less medals than expected given their high GDPs. Countries like Luxembourg and Monaco have very high GDPs per capita as they are very wealthy nations but have few to no medals.

I suspect that the countries mentioned which are below the average medals won for GDP– Switzerland, Norway, and Sweden– have fewer medals because they tend to be more dominant at Winter Olympics and do not excel, necessarily, at Summer Olympic sports.

I also believe these small but wealthy countries such as Luxembourg and Monaco do not have high medal counts because of their size. They likely do not send many athletes to the Games and therefore have few chances to win medals.

Just as GDP or economic wealth indicators may positively influence medal counts, population and number of athletes sent, likely matters too.

In a 2000 study "Who Wins the Olympic Games: Economic Resources and Medal Totals", Bernard and Busse, find that both population and GDP have positive and significant correlations with countries' medal counts at the Olympics. The results of their study shows that countries "having resources to invest in human ability is important in producing success," (Bernard, Busse).

Visualization 3b:

Olympic medal counts over time

To see if some of these outliers and top-performing countries have been dominant throughout time, I explored their medal counts over time since 1984 (because China's territory and NOC designation has changed since this period).

This visualization shows that the USA has been pretty dominant all throughout this time-period, however they excel more at Summer Olympic Games than Winter ones. As suspected, Switzerland (and likely the other countries mentioned below average medals for their GDP) wins more medals during the Winter Olympics than the Summer ones.

I can also see that the U.S. did extraordinarily well in 1984, and China did extraordinarily well in 2008. Interestingly, the U.S. hosted the 1984 Olympics and China hosted the 2008 Olympics. Could hosting the games advantage the host country?

How Does Hosting and Number of Athletes Affect Olympic Success?

Visualization 4a:

Hosting and # of Athletes Case Study: USA vs China

Medal Count (Summer Olympics 1984-2016)

The previous visualization seems to indicate that hosting the olympics may positively contribute to a country's medal count. This next visualization seems to corroborate that intuition as well. In 1984, when the U.S. hosted the Summer Olympics in LA, USA boasted a commanding lead in medals won. In 2008, when China hosted the Summer Olympics in Beijing, China enjoyed extraordinary success over their opponents and won more medals than they ever had before. This indicates a host advantage.

Visualization 4b:

Hosting and # of Athletes Case Study: USA vs China

Medal Count Per Athlete Sent (Summer Olympics 1984-2016)

Why does a country do better when they host the Olympics? Typically, a country is able to send many more athletes to the Olympics when they are hosted in an athlete's home country. As noted earlier, typically, the more athletes a country sends to the Olympics, the more medals they will win.

In "Coming to Play or Coming to Win: Participation and Success at the Olympic Games", Johnson and Ali found that host nations win 24.87 more medals than non-host nations on average. They also found that neighbors of host nations tend do well too. This host advantage is attributed to lower transportation costs and climatic advantages. The nations that are most likely to be hosts are also large, wealthy nations (high population and high GDP) which we have established already have a medal count advantage. Bernard and Busse attribute host advantage to less expensive attendance costs, facilities, and the influence of audience on judging (especially in sports where subjective judgement is involved, like gymnastics, for example). Other studies suggest the notion that the host country enjoys a medal advantage is false. The study "Hosting the Olympic Games: An Overstated Advantage in Sports History" finds that host countries' advantage can be attributed mainly to larger contingents of athletes. The qualification criteria is often relaxed for host nations, making it more feasible to field larger amounts of athletes. This study showed that, on average, in the Summer Olympic Games a host country’s team fields roughly 162.2 more athletes than in the previous Summer Games. The number of athletes is what results in the larger medal haul.

When I control for number of athletes sent to the Games there is less of a gap between the U.S. and China, and less variation between different Olympic Games. The U.S. and China both sent a lot more athletes than normal to the 1984 and 2008 Olympics respectively. Therefore, their Medals per athlete scores, which shows the efficiency with which each country won medals, are closer together than their raw medal counts. In terms of Medals per athlete, China did best in 2012, not in the year they hosted. The U.S. also hosted the 1996 Olympics but had a higher Medals per athlete score than that year in 6 of 8 other Games played between 1984 and 2016. While countries enjoy an advantage to hosting the games, much of that advantage is probably attributed to sending more athletes to those Games.

Sources used for this notebook: